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Abstract 

Due to notable discoveries in the fast evolving field of complex networks, recent research 
in software engineering has also focused on representing software systems with networks. 
Previous work has observed that these networks follow scale-free degree distributions 
and reveal small-world phenomena, when we here explore another property commonly 
found in complex networks, i.e. community structure. We adopt class dependency net- 
works, where nodes represent software classes and edges represent dependencies among 
them, and show that these networks reveal significant community structure, character- 
ized by similar properties as observed in other complex networks. However, although 
intuitive and anticipated by different phenomena, identified communities do not exactly 
correspond to software packages. We empirically confirm our observations on several 
networks constructed from Java and various third party libraries, and propose different 
applications of community detection to software engineering. 

Keywords: community structure, complex networks, software systems 
PACS: 89.75.Fb, 89.75.Hc, 89.20.Ff 



1. Introduction 

Analysis of complex real- world networks has led to some significant discoveries in the 
recent years. Research community has revealed several common properties of various real- 
world networks including different social, biological, Internet, software and other 
networks. These properties provide an important insight in the function and structure 
of general complex networks 0, @li moreover, they allow for better comprehension of the 
underlying real-world systems and thus give prominent grounds for future research in a 
wide variety of different fields. 

In the field of software engineering, network analysis has just recently been adopted 
to acquire better comprehension of the complex software systems 0,0, HQ- Nowadays, 
software represents one of the most diverse and sophisticated human made systems; 
however, only little is known about the actual structure and quantitative properties 
of (large) software systems. Cai and Yin Q have denoted this dilemma as software law 
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problem, which represents an effort towards identiiying and formulating physics-like laws, 
obeyed by (most) software systems, that could later be applied in practice. However, the 
main objective of software law problem is in investigating how software looks like. 

In the context of employing complex networks analysis, research community has al- 
ready made several discoveries over the past years. In particular, different authors have 
observed that networks, constructed from various software systems, follow scale-free 
(i.e. power-law) degree distributions and reveal small-world [l| phenomena. We proceed 
their work by exploring another property commonly found in real- world networks, i.e. 
community structure [3j. The term denotes the occurrence of local structural modules 
(communities) that are groups of nodes densely connected within and only loosely con- 
nected with the rest of the network. Communities play crucial roles in many real-world 
systems [lol. l5l|. however, the community structure of complex software system networks 
has not yet been thoroughly investigated. 

Main contributions of our work are as follows. We adopt class dependency networks, 
where nodes represent software classes and edges represent dependencies among them, 
and show that these networks reveal significant community structure, with similar prop- 
erties as observed for other complex networks. We also note that network, representing 
core software library, exhibits less significant community structure. Furthermore, we 
prove that, although intuitive and anticipated by different phenomena, revealed commu- 
nities do not (completely) correspond to software packages. Thus, we demonstrate how 
community detection can be employed to obtain highly modular software packages that 
still relate to the original packaging. 

The rest of the article is structured as follows. First, in section [2j we briefly present 
relevant related work and emphasize the novelty of our research. Next, section |3] intro- 
duces employed class dependency networks. In section 2] we present empirical evaluation 
of community structure of class dependency networks, and propose possible applications 
to software engineering. Last, in section [SJ we give final conclusions and identify areas 
of further research. 



2. Related work 



Although software law problem has already been investigated over 30 years 
research community has only recently begin to employ network analysis to gain better 
comprehension of the software systems 0,BB[I2l- mentioned above, different authors 
have observed that networks, constructed from software systems, follow scale-free degree 
distributions [l^, S [3 [3l and exhibit small- world property [g, [l^ 17 1 . Software net- 



works thus reveal common behavior, similar as observed in other complex networks 
Furthermore, authors have also identified several different phenomena (e.g. software op- 
timization) that might govern such complex behavior 1^ Moreover, analysis 



of clustering [l| has revealed hierarchical structure in software networks 

On the other hand, community structure of software networks has not yet been inves- 
tigated. Several authors have already discussed the notion of communities in the context 
of software systems 0, [2^ [13, HH 0], however, no general empirical analysis and formal 
discussion was ever conducted (due to our knowledge). Still, authors have observed dif- 
ferent phenomena that could promote the emergence of community structure in software 
networks [2^0] and discussed possible applications within software engineering and other 
sciences [6|, |7[ • 
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Figure 1: Class dependency network for JUNG graph and network framework [23l . Node colors indicate 
four high-level packages of the framework — visualization (green), algorithms (red), graph (orange) 
and io (blue). The network reveals rather clear community structure that roughly coincides with the 
software packages. 



In a wider context of software networks analysis, different random- walk based mea- 
sures have been proposed to measure key (i.e. most influential) classes and pack- 



ages [22, |9[. The researchers have also investigated connectedness, robustness, motifs 
and patterns within software nctworks^^ [2l| . Just recently software systems were also 
treated as evolving complex networks [8|. 



3. Class dependency networks 

Previous research on the analysis of software systems has employed a variety of dif- 
ferent types of software networks (i.e. graphs). In particular, package, class and method 
collaboration graphs Ig, , suhrutine call graphs [6| , software ard^eci^e ^ and soft- 
ware mirror graphs [8|, software architecture maps [13| . inter-package dependency net- 
works [l^l and others 0, llZ, 23 1. The networks primarily divide whether they are con- 



structed from source code, byte code or software execution traces, and due to the level 
of software architecture they represent. However, as discussed in section[2l most of these 
networks share some common characteristics. 

For the purpose of this research we introduce class dependency networks (Fig. [1]). 
Here an object-oriented software is represented by an undirected multi-graph G{N,E), 
where A'' is the set of nodes and E is the set of edges. Graph G is constructed from 
software source code in the following manner. Each software class c is represented by a 
node Uc G N , when edge {ncj^^Uc^} S E represents a dependency between classes ci and 
C2. Dependencies are of four types, namely, inheritance (class C2 inherits or implements 
class ci), field [ci contains a field of type ci), parameter (ci contains a method that takes 
type ci as a parameter) and return [c\ contains a method that returns type C2). 

Note that class dependency networks are constructed merely from the header infor- 
mation of the classes, and their methods and fields. As this information is commonly 
determined by a group of developers, prior to the actual software development, it is less 
influenced by the subjective nature of each particular developer. Hence, the networks 



thus more adequately represent the (intended) structure of some particular software, still, 
some relevant information might thus be discarded. 

An example of class dependency network is shown in Fig. [TJ The network reveals 
strong community structure, furthermore, the communities also roughly coincide with 
the actual software packages. However, as will be shown in section [U modularity of the 
natural communities, depicted in the network's topology, is much larger than that of the 
packages, determined by the developers. 

4. Empirical analysis and applications 

In the proceeding sections we present and discuss results of the empirical evaluation of 
community structure of class dependency networks (section UTT]), address the relation be- 
tween communities and software packages (section l4.2p and propose possible applications 
of community detection to software engineering (section [4. 3p . 

The empirical evaluation is done using 8 class dependency networks constructed^ from 
Java and several third party libraries (Table [T]). The networks range from those with 
hundreds of nodes to those with several tens of thousands of edges (all isolated nodes 
have been discarded). Due to generality, networks were selected thus they represent a 
relatively diverse set of software systems. 



Table 1 : Class dependency networks for different software systems (\P\ is the number of packages) . 



Network 


Description 


|iV| 


\E\ 


1^1 


junit 


JUnit 4.8.1 (testing framework). [251 


128 


470 


22 


jmail 


JavaMail 1.4.3 (mail and messaging framework). [261 


220 


893 


14 


flamingo 


Flamingo 4.1 (GUI component suite). [271 


251 


846 


16 


jung 


JUNG 2.0.1 (graph and network framework). [24] 


422 


1730 


39 


colt 


Colt 1.2.0 (scientific computing library). [28] 


520 


3691 


19 


org 


Java 1.6.0 {org namespace). [291 


716 


7895 


47 


javax 


Java 1.6.0 {javax namespace). [29] 


2581 


22370 


110 


java 


Java 1.6.0 {java namespace). [29] 


2378 


34858 


54 



To reveal community structure of each network we employ three community detec- 
tion algorithms. In particular, a divisive algorithm based on ed ge betweenness Q, a 
greedy agglomcrative optimization of modularity (see below) (sol . 31 1 and a fast parti- 
tional algorithm based on label propagation [s^ . The algorithms are denoted EB, MO 
and LP respectively, whereas, the detailed description is omitted. It should be noted 
that our objective is not to compare the algorithms, but rather to compare the revealed 
communities, and thus address their stability. 

The community structure, identified by the algorithms, is assessed using modularity 
Q (33I I that measures the significance of communities due to a selected null model. Let 
k be the community (label) of node n.i E N and let Aij denote the number of edges 



^Networks were constructed by parsing JAR archives provided by the developers. However, due to 
various issues, some of the software classes might thus have been discarded. 
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incident to nodes ni,nj € N. Furthermore, let be the expected number of incident 
edges for rii^rij in the null model. The modularity then reads 

Q = ^ E iA,,-P,,)Sik,l,), (1) 

where m is the number of edges, m = \E\, and 6 is the Kronecker delta. The modularity 
thus measures the fraction of the difference between the number intra-community edges 
and the expected number of edges in the null model {Q G [—1,1]). Higher values rep- 
resent stronger community structure. Commonly a random graph with the same degree 
distribution as the original is selected for the null model. Hence, = where 
ki is the degree of node n.i G N. It should be noted that modularity has some known 



deficiencies, e.g. resolution limit [3J|, however, it is still widely adopted for the analysis 
of network community structure. 

Furthermore, the identified community structure is also compared to the actual soft- 
ware packages. Let C be the partition (i.e. communities) revealed by some algorithm 
and V the partition that represents software packages (corresponding random variables 
arc L and P respectively). We compare the partitions by computing their normalized 
mutual information NMI [s!] {NMI G [0,1]). Hence, 

~ H{L) + H{Py 

where I{L,P) is the mutual information of the partitions, I{L,P) = H{L) — H{L\P), 
and H{L), H{P) and H{L\P) are standard and conditional entropies. NMI of identical 
partitions equals 1, and is for independent partitions. 

4.I. Community structure of class dependency networks 

Mean modularities obtained with three community detection algorithms for the se- 
lected set of class dependency networks (section |4]) can be seen in Table [2] For all 
networks except java, the algorithms managed to reveal community structures with par- 
ticularly high values of modularity, i.e. between 0.55 and 0.75 on average, where values 
above 0.30 are commonly regarded as an indication of (significant) community struc- 



ture [32l l36l . l37l . |38[ . The networks thus reveal much stronger community structure than 
expected in a random network with the same degree distribution. Note also that all of 
the algorithms obtain high modularities for all of the networks considered. This indicates 
rather stable communities, strongly depicted in the networks' topologies. 

In the case oijava network, observe that the average degree is considerably larger than 
for other networks (Table [!}. Hence, the network is extremely dense and the communities 
are thus only loosely defined. Consequently the algorithms fail to attain any significant 
community structure; however, as the network represents the core of Java programming 
language, it is expected to convey less modular structure. 

In Fig. [5] we show the (cumulative) distributions of community sizes obtained with LP 
algorithm for jung, javax and java networks. Interestingly, the distributions (roughly) 
follow power-laws with the exponents a around 2 (i.e. P(s) ^ where s is the 

community size). Scale- free distribution of community sizes is a common property, ob- 
served also in other complex networks [3^0]; furthermore, the values of a also coincide 
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Tabic 2: Mean modularities Q obtained for class dependency networks of different software systems. 
Values were computed from 100 iterations (10 iterations for EB algorithm), where missing values could 
not be recovered due to limited time resources. Modularities of the natural community structures, 
depicted in the networks' topologies (i.e. extracted by the algorithms), are much larger than those of 
the actual software packages. 



Network 


EB 


MO 


LP 


P+ 


P 


junit 


0.5587 


0.5759 


0.5542 


0.1140 


0.0893 


jmail 


0.5607 


0.5972 


0.5401 


0.2350 


0.2086 


flamingo 


0.6466 


0.6823 


0.6485 


0.2870 


0.2511 


jung 


0.7210 


0.7324 


0.6874 


0.3279 


0.3212 


colt 




0.6025 


0.5599 


-0.0158 


-0.0332 


org 




0.5599 


0.5254 


0.1847 


0.1830 


javax 




0.7667 


0.7422 


0.3119 


0.2907 


java 




0.4664 


0.4132 


0.2269 


0.2206 



with values found for other networks, where authors commonly report a between 1 and 
3 [3l|,[33,|3i,[3. 

We conclude that class dependency networks contain significant community structure 
that also reveals similar properties as observed in other complex networks. Thus, besides 
scale-free degree distributions and small-world effect, software networks reflect another 
common network phenomena, i.e. community structure. 

To further address the validity of our results, we briefly discuss different phenom- 
ena that could promote the emergence of community structure in software networks. 
Li et al. and Jenkins and Kirk |2ll | have discussed the influence of internal cohesion, 
i.e. functional strength of the components, and external coupling, i.e. inter-dependencies 
among components, on the structure of software systems (and networks). Highly modular 
software should clearly demonstrate minimum coupling-maximum cohesion principle [40j . 
which would naturally promote the emergence of strong structural modules within soft- 
ware networks. The modularity of software networks thus reflects the modularity of 
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Figure 2: Cumulative distribution functions of community sizes for jung, javax and java networks. 
The distributions revealed by LP algorithm (roughly) follow power-laws with the exponents a shown 
(i.e. P{s) ~ s~°', where s is the community size); however, the distributions of package sizes arc not 
characterized by power-laws (e.g. log- normal distributions). 
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Figure 3: Community network for jung class dependency network (Fig. ^ revealed by LP algorithm 
(modularity equals Q = 0.7062). For each community we show the distribution of classes over software 
packages (weakly represented packages arc not shown), where colors indicate four high-level packages of 
the framework (see Fig. [TJ. Communities clearly distinguish between high-level packages, but they do 
not completely coincide with the actual (bottom-most) packages. 

underlying software systems. 

Furthermore, Baxter et al. [1^ have emphasized that object-oriented software is com- 
monly developed according to Lego hypothesis t41|] . The hypothesis states that software 
is constructed out of a larger number of smaller components that are relatively indepen- 
dent of each other. This phenomena should clearly reflect in software networks, where 
components should emerge as network communities. 

In summary, software networks enclose a strong natural tendency to form community 
structure. In the case of class dependency networks, communities should, due to the 
above discussion and by intuition, correspond to software packages (Fig. [3]). This aspect 
is thoroughly explored and discussed in the proceeding section. 

4-2. Relation of network communities to software packages 

The analysis of the relation between network communities and software packages 
reveals that packages are considerably different than communities. We first note that 
packages do not feature connectedness in class dependency networks (exact results are 
omitted). The latter is regarded as a basic property of communities and states that 
communities should correspond to connected sets of nodes. As a consequence, software 
packages can comprise of disconnected sets of nodes, which is an indicator of relatively 
poor modular structure. 

Let P represent the actual (bottom-most) software packages and let P+ represent 
packages that feature connectedness (i.e. disconnected packages are treated as several 
different packages) . Table [5] shows modularities of software packages for the analyzed 
class dependency networks. The values are considerably lower than modularities of the 
natural community structures, revealed in the networks' topologies (i.e. extracted by the 
algorithms), and cannot be regarded as an indication of significant modular structure. 
Moreover, in Fig. [5] we show the distributions of package sizes for jung, javax and java 
networks. The distributions are obviously not characterized by power-laws, as observed 
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Table 3: Peak (maximum) NMI between network communities, extracted by the algorithms, and soft- 
ware packages P for different class dependency networks. Values were computed from 100 iterations 
(10 iterations for EB algorithm). The results indicate relatively poor correspondence between natural 
network communities and software packages. 



Network 


EB 


MO 


LP 


P+ 


junit 


0.6605 


0.5823 


0.6285 


0.8412 


jmail 


0.5300 


0.5248 


0.5553 


0.8379 


flamingo 


0.5686 


0.5408 


0.5590 


0.7882 


jung 


0.6011 


0.6094 


0.6887 


0.9187 


colt 




0.4784 


0.5277 


0.6507 


org 




0.5301 


0.5385 


0.9123 


javax 




0.6365 


0.6826 


0.8096 


java 




0.3453 


0.3063 


0.8386 



in the case of communities (distributions are, e.g., log- normal or stretched exponential, 
which coincides with the observations in l20[). Last, we also (directly) compare the 
packages with network communities by computing NMI of the corresponding partitions 
(Table[3]). The results further confirm above observations - software packages only weakly 
relate to network communities and are not characterized by the same laws or properties. 

We stress that the origin of the disparity between network communities and software 
packages is not entirely evident. The lack of connectedness of software packages, and low 
values of modularity, suggest that class dependency networks give poor representation of 
software systems or disregard some relevant relations among classes (form the perspective 
of software packages). However, different distributions of sizes clearly show that there 
is some additional departure between the communities and software packages, which is 
independent of the actual network representation (i.e. class dependencies). 

Last, we discuss a particularly low value of modularity for colt library packages (Ta- 
ble [5]) . As the library represents a core framework for scientific computing, where the 
performance is often of greater importance than extensibility, maintenance and modular 
structure, it is expected for the system to exhibit only poor modular structure. The 
modularity of software packages thus reflects the modularity of the underlying software 
system, which in fact motivates the application, presented in the proceeding section. 

4-3. Applications of community detection to software engineering 

Due to weak modular structure of software packages, an obvious application of com- 
munity detection to software engineering is to reveal highly modular packaging of software 
systems (Fig. H]). The choice of class dependencies (i.e. type of the network) is in that 
case of course arbitrary. However, simply applying a community detection algorithm 
to employed networks would often prove useless, as the identified communities would 
only hardly be mapped to the existing software packages. The latter is vital due to the 
comprehension of the results. A simple solution is to start with the communities that 
represent original software packages, and refine them, using some community detection 
algorithm. The algorithm should thus merely refine and merge the communities, where 
no new communities (i.e. labels) should be introduced. This preserves original software 
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Figure 4: Community networks for class dependency network, representing classes within cern.colt and 
cern. jet packages of colt library (reduced to the largest connected component). Networks correspond to 
the original software packages P (left) and communities, revealed with LP algorithm by refining software 
packages P (right). For each community we show the distribution of classes over software packages, 
where colors indicate high-level packages of the framework. Refined communities (i.e. packages) obtain 
significantly higher modularity and can still be related to the original packaging. 



packages, their hierarchy and identifiers, which enables complete comprehension of the 
final results. An example can be seen in Fig. |4l 

Another obvious application to software engineering is (network) abstraction. Com- 
munity detection can be employed to obtain a clear representation of software systems 
on a relati vely high level of abstraction. Furthermore, one can also address the cen- 
trality 42, |43[ (or other measures of influence) of the identified communities, to expose 
key nodes and structures throughout the entire system [23, 01 • A simple application of 
community detection to software abstraction can be seen in Fig. [5] (and Fig. [3|). 

The article represents seminal work in the area of applying network community de- 
tection methods and techniques in software engineering. However, further work is needed 
to design sophisticated applications that would be of considerable benefit in practice. 



5. Conclusion 

The article explores community structure of networks, constructed from complex 
software systems (i.e. class dependency networks). The main contribution is in showing 
that software networks reveal significant community structure, characterized by similar 
properties as commonly observed for other complex networks. Software networks thus 
reveal another general network phenomena, besides scale-free degree distributions and 
small-world effect, which is a prominent observation for the software-law problem. Fur- 
thermore, the results are of even greater importance, as software represents one of the 
most complex human made systems. 

Future work will mainly focus on considering other types of class dependency net- 
works that will include additional relations among classes. Moreover, we will introduce 
the notions of positive and negative relations, to more adequately model similarity and 
diversity among software classes. The main objective will be to establish further un- 
derstanding of (community) structure of class dependency networks, and to assess its 
relation to software packages. The results could thus promote various novel applications 
in the software engineering domain. 
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Figure 5; Community network for javax class dependency network revealed by LP algorithm (only the 
largest five connected components arc shown; modularity equals Q = 0.7318). For each community we 
show the distribution of classes over software packages, where colors indicate high-level packages of the 
framework. The representation gives a clear insight into the structure of the javax namespace, and shows 
relations (i.e. dependencies) among different sub-packages of the system. 
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